Loving you is complicated: Quantifying musical sentiment and success through data science

Spotify has quickly become the most popular music streaming service in the world with over 271 million active users every month. A close collaborator of Spotify is Genius), a website with 26.5 million monthly users that allows members of their community to upload, annotate, and interpret lyrics from music artists. Inspired by the blog of Thompson Analytics I will use the Spotify API, the Genius API and the NRC Emotion Lexicon to quantify both the musical sentiment and lyrical sentiment of one of the most prominentes artists of our time: Kendrick Lamar. A visualization of his discography will accompany this process of data collection. In addition to this, I will track the data of the top 100 most streamed songs of all time and will use different regression techniques to determine the usefulness of these features when it comes to the prediction of musical success.

PART I: Collecting Kendrick Lamar's Discography.

Spotify features

The Spotify API assigns musical features to every track on their platform. These features are defined in the following way:

I will be using these features for visualization and analytical purposes in the first and second part respectively.

Getting the Data from Spotify

In order to get the data from spotify I will be using Spotipy , a lightweight Python library for the Spotify Web API:

I will be using Plotly for most visualization tasks since I love how crisp it looks. For the second part of this project I will be relying on Scikit-learn for the computation of different prediction model. In addition to this I need to set the credentials obtained from my Spotify API application to access their data.

Next, I define a series of functions that will allow me to gather data from the Spotufy API and put it into a dataframe:

Done! Let's see the functions in action.

The API tends to be quite unreliable. At times the process might yield an error so the previous step might require more than one attempo. Anyways, I got the data in the second try, let's see how it looks

Looking good. However, notice that we still got some repeated observations because of the deluxe edition of good kid, m.A.A.d city and the collector's edition of DAMN.. I will keep the former and drop the latter because of popularity reasons. Moreover, I will be dropping the Black Panther's soundtrack since I believe it does not stand as a singular effort by Kendrick, but rather as a collective effort that seeks to accompany the movie.

Getting the data from Genius

I will be using the LyricsGenius package that provides a simple interface to the song, artist, and lyrics data stored on Genius.

Like before, we set the credentials obtained from my Genius app.

My plan is to iterate through my dataframe looking for every song name and creating a list with the lyrics. I decided against applying a function to the dataframe since the Genius API is very unreliable, (it like to throw out errors at random: at times it works, at times is just does not!). Albeit this approach is more lengthy, it is much more reliable. Before doing any of this, I will need to do some name standarization between my dataframe and the Genius API (titles that include the word featuring are problematic). Please forgive the profanity in the following cell.

We are good to go! Let's see how the loop performs.

A quck inspection of the list of lyrics reveals that we got a few None values. Let's fix those:

We are ready to extract the lyrics and append them into the dataframe.

Quantifying Lyrical Sentiment.

In order to quantify lyrical sentiment I will use the NRC Emotion Lexicon. This dataset assigns different emotions or sentiments to english words. For example, the word abandon is mapped to the emotions of fear and sadness. My strategy is to compute the proportions for each of the sentiments. I recognize that this is a very limited approach to lyric analysis but at the same time I believe that it feels like a good starting point given the current constraints.

Let's create a function that given a song's lyrics and a emotion computes the number of words in the song associated to the emotion.

Next, we create a function that computes the number of words associated to each emotion in a song and computes its proportions.

Finally, we append these proportions for each song in the dataframe:

As desired. We are ready to begin visualizing Kendrick's discography!

Visualization

First I wanted to see what type of lyrical and musical features dominate - in average - in each Kendrick album. To do so, I decided to produce an interactive radar plot with all albums overlayed in top of each other. This plot might look messy at first, but one can select (or unselect) each album by clicking over its legend or plot. In this way, we have a clear yet compact visualization.

First let's visualize the musical features of each of his albums.

Still, while this visualization is effective at comparing what features dominate in one particular album, it is not as effective for comparing one feature across his dicograpgy. For such purposes it is better to create a bar chart:

Overall, I would argue that these visualizations adequately represent Kendrick's music. For example, untitled unmastered stands out for its live instrumentation and somewhat subdued tone. On the other hand, Kendrick is characterized for his intertwined story telling: To Pimp a Butterfly references good kid, m.A.A.d city while DAMN. references its two predecessors. It is to be expected that his lyrics have pretty similar features across his discography. One final point of interest is the predominance of both negative and positive lyrics. While seemingly contradictory, duality has been a staple in Kendrick's music. For example, consider, u and i from To Pimp a Butterfly : members of the Genius community have noticed the the duality between both tracks: "u acts as a complete contrast to its lead single i, an anthem of peace, positivity, and prosperity starting with self-love".

I hope this first part of the project incentivizes the reader to explore the discography of their favorite artist!

PART II: Predicting musical success.

While it was fun to look at Kendrick's discography, such data will not allow me to answer the main question that drives this research: what are the determinants of popularity in music? This is because the Spotify API does not allow us to count the number of times a song has been streamed. The closer feature to this would be a track's popularity , however, there are two issues: the algorithm that drives this variable is not know and the variable appears like a better indicator of current popularity - i.e what is trending - rather than all-time success. Thus I decided to use a dataset from Wikipedia that contains information about the Top 100 most streamed songs of all time. I will use once again similar methods to obtain both musical and lyrical features from each song.

First I standarize the data such that we can find its musical features in Spotify first. I write a few functions that will help me for that task.

We are ready to standarize the data.

Now, we obtain the id's for every track

Finally, we create the dataframe.

We have the musical features with the number of streams. Now, let's add the lyrics! However, before adding them it is better to standarize the naming of songs across the current database and the Genius API since the wrapper for the latter is unreliable. This way, we minimize the chance of error.

There are still some errors despite the standarization. We solve these by detecting the null values and replacing by correct values

Next, we append the lyrics into the dataframe and operate to get its emotional features.

Finally, we use the NRC Lexicon to obtain the lyrical features of each song:

Finally, we transform the release date into a date type and obtain the number of days since release as a measure of the song's longevity. First, we create a few helper functions.

Finally, we apply some transformations on a few variables in order to minimize the condition number of the linear regression.

We are ready to go! Let's visualize the relationship between the musical and lyrical features of a song and its popularity.

Let's try to analyze lyrical features next!

The model seems to perform quite OK. Still, I am a bit worried some variables are not bringing anything to the table. I will try to refine the model in the next section.

Refining the model

Let's implement a Lasso regression and compare its MSE to our previous linear model.

Interestingly, it seems that our Linear model manages to outperform our Lasso models. Let's visualize their different predictions as before.

Something fishy is going on with the lasso regression... It looks like an horizontal line! Let's extract the coefficients of both model and compare them.

Our suspicions are confirmed: all Lasso coefficients are 0. That's a pretty weird result. This means that our Lasso regression reduces to the simplest of estimators: sample mean of the dependent variable. Let's splice the model between testing and training data and compare their MSE to check whether our linear model still beats the Lasso model.

Even if the linear model performs better on the training stage, Lasso (and hence, the naive estimator) beats our linear model. That's a quite worrying fact, since it means that our variables have very limited predicting capabilites. I will try one last model: a neural network model as implemented during the lecture notes. Let's see if we can beat both other models. First, I use it on the whole dataset.

Cool! This model seems to outperform both previous models by a very good margin. Let's visualize the predictions.

Finally, let's divide split the dataset between a training/testing part and hope that our NN model beats the naive estimator.

Sadly, none of our models were able to beat the naive estimator in the testing phase.

Concluding remarks

I would like to continue refining these models as most likely there is an issue of overspecification. However, the project is already quite long so I feel this is a good place to end it. Visually, it seems to me that musical features are not very good indicators of popularity while lyrical features have stronger predictive capabilities: people tend to like positive and happy music, at least when it comes to its lyrical content. However, the triumph of the naive estimator is a quite dissapointing result, so further research on this topic would focus on which regressors should be included in the model and which regressors should be droped.

Besides these conclusions, my main objective with this project was to lay tge groundwork for future data analysis of music. That is, to create a somewhat structured way to gather this data and quantify its features. Still, given that the Spotify API does not gives us a way to access the number of times a song has been streamed, alternative ways to measure this variable should be explored. Otherwise conducting further research will prove difficult.